Understanding Inefficiencies in Data-Intensive Computing
نویسندگان
چکیده
New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how such systems perform relative to the capabilities of the hardware on which they run. This paper describes a simple model of I/O resource consumption that predicts the ideal lowerbound runtime of a parallel dataflow on a particular set of hardware. Comparing actual system performance to the model’s ideal prediction exposes the inefficiency of a scale-out system. Using a simplified dataflow processing tool called Parallel DataSeries, we show that the model’s ideal can be approached (i.e., that it is not wildly optimistic), but that a gap of up to 20% remains for workloads using up to 45 nodes. Guided by the model, we analyze inefficiencies exposed in both the disk and networking subsystems—issues that will be faced by any DISC system built atop popular commodity hardware and OSs. Acknowledgements: We thank the members and companies of the PDL Consortium (including APC, EMC, Facebook, Google, HewlettPackard Labs, Hitachi, IBM, Intel, LSI, Microsoft Research, NEC Laboratories, NetApp, Oracle, Riverbed, Samsung, Seagate, STEC, Symantec, VMWare, and Yahoo! Labs) for their interest, insights, feedback, and support. This research was sponsored in part by an HP Innovation Research Award and by CyLab at Carnegie Mellon University under grant DAAD19–02–1–0389 from the Army Research Office.
منابع مشابه
Data Replication-Based Scheduling in Cloud Computing Environment
Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...
متن کاملApplying Simple Performance Models to Understand Inefficiencies in Data-Intensive Computing
New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how these systems perform relative to the capabilities of the hardware on which they run. This paper describes a simple analytical model that predicts the theoretic ideal performance of a parallel dataflow s...
متن کاملSupporting Large Scale Data-Intensive Computing with the FusionFS Distributed File System
State-of-the-art yet decades-old architecture of HPC storage systems has segregated compute and storage resources, bringing unprecedented inefficiencies and bottlenecks at petascale levels and beyond. This paper presents FusionFS, a new distributed file system designed from the ground up for high scalability (16K nodes) while achieving significantly higher I/O performance (2.5TB/sec). FusionFS ...
متن کاملCompiler and Runtime Supports for High-Performance, Scalable Big Data Systems
Big Data analytics applications such as social network analysis and web analysis have revolutionized modern computing. The processing demand posed by an unprecedented amount of data challenges both industrial practitioners and academia researchers to design and implement highly efficient and scalable system infrastructures. Unfortunately, Big Data processing is fundamentally limited by memory i...
متن کاملA survey on impact of cloud computing security challenges on NFV infrastructure and risks mitigation solutions
Increased broadband data rate for end users and the cost of resource provisioning to an agreed SLA in telecom service providers, are forcing operators in order to adhere to employment Virtual Network Functions (VNF) in an NFV solution. The newly 5G mobile telecom technology is also based on NFV and Software Define Network (SDN) which inherit opportunities and threats of such constructs. Thus a ...
متن کامل